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Abstract —A system is presented that segments, clusters and 
predicts musical audio in an unsupervised manner, adjusting the 
number of (timbre) clusters instantaneously to the audio input. A 
sequence learning algorithm adapts its structure to a dynamically 
changing clustering tree. The flow of the system is as follows: 1) 
segmentation by onset detection, 2) timbre representation of each 
segment by Mel frequency cepstrum coefficients, 3) discretization 
by incremental clustering, yielding a tree of different sound 
classes (e.g. instruments) that can grow or shrink on the fly driven 
by the instantaneous sound events, resulting in a discrete symbol 
sequence, 4) extraction of statistical regularities of the symbol 
sequence, using hierarchical N-grams and the newly introduced 
conceptual Boltzmann machine, and 5) prediction of the next 
sound event in the sequence. The system’s robustness is assessed 
with respect to complexity and noisiness of the signal. Clustering 
in isolation yields an adjusted Rand index (ARI) of 82.7% / 
85.7% for data sets of singing voice and drums. Onset detection 
jointly with clustering achieve an ARI of 81.3% / 76.3% and the 
prediction of the entire system yields an ARI of 27.2% / 39.2%. 

Index Terms —Music information retrieval, unsupervised learn¬ 
ing, adaptive algorithms, prediction algorithms 

I. Introduction 

Human music listening adapts to novel acoustic stimuli and 
is largely based on unsupervised learning, in contrast to most 
traditional music analysis systems. For music transcription a 
prediction |8]|36||, representation I22ll32ll . automatic accompa¬ 
niment, or human-machine-improvisation El El, a traditional 
system usually is based either on symbolic data instead of 
audio input, or on classifiers that are pre-trained on a labelled 
data base J9). If a system, based on pre-trained classifiers needs 
to cope with new musical concepts (instruments, harmonies, 
pitches, motifs) it has not been designed for, it may cease to 
work reasonably. Such a system would have to be retrained 
with labeled data, every time a new instrument (pitch, harmony 
etc.) appears. This presents a severe lack of flexibility of 
such a system, in contrast to human cognition processing new 
instruments and harmonies with ease, even if one has not 
heard them before. A human mind can grasp a novel motif, 
when listening to a piece or an improvisation. Unsupervised 
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learning (clustering) instead of supervised classification is one 
paradigm how an algorithm can model the cognition of novel 
concepts 1(171 [24] [26l . Based on a discrete representation of the 
input derived by clustering, an n-gram, i.e. a suffix tree, can 
be used as a statistical representation of the structure of the 
input sequence [|8]|36j|. In this paper, we extend such a system 
by equipping it with the capability to deal with a varying 
number of clusters. The number of clusters can increase if 
a new instrument appears. The cluster number decreases if 
two instruments become to sound very similar. We implement 
these features by using unsupervised online learning. This 
requires that the n-gram (suffix tree) must be coupled with 
the clustering in order to be able to merge or split the symbol 
counts when cluster numbers change. We introduce a system 
prototype that learns in an unsupervised, adaptive manner and 
that generates predictions from audio sequences. From the first 
note it will begin to generate reasonable predictions without 
using previous knowledge. 

Many previous approaches to predicting musical sequences 
are based on symbolic representation |2j [SJ [22] [32] |34, 36). 
Paiement et al. G3 present a model that is capable of 
predicting and generating melodies using a combination of 
Bayesian networks, clustering, rhythmic self-similarity and 
a special representation of melody. The distances between 
rhythmical patterns are clustered and the continuation of a 
melody is predicted conditioned on the chord root, chord type, 
and Narmour group of recent melodic notes. Hazan et al. 
ca build a system for generation of musical expectation 
that operates on music in audio data format. The auditory 
front-end segments the musical stream and extracts both 
timbre and timing description. In an initial bootstrap phase, an 
unsupervised clustering process builds up and maintains a set 
of different sound classes. The resulting sequence of symbols 
is then processed by a multi-scale technique based on n-grams. 
Model selection is performed during a bootstrap phase via 
the Akaike information criterion. Marchini and Purwins |24l 
present a non-adaptive system that learns rhythmic patterns 
from drum audio recordings and synthesizes music variations 
from the learned sequence. The procedure uses a fuzzy multi¬ 
level representation. Moreover, a tempo estimation procedure 
is used to guarantee that the metrical structure is preserved in 
the generated sequence. Online clustering has been proposed 
by Zhang et al. 146) for document clustering. Bertin-Mahieux 
et al. 0 have used online k-means to cluster beat-chroma 
patterns. The Hierarchical Dirichlet Process Hidden Markov 
Model (HDP-HMM) ll43l has been used for segmentation in 
conjunction with clustering. Fox et al. m and Ren et al. J40) 
have proposed ’sticky’ versions of the HDP-HMM that in¬ 
troduce explicit modelling of state occupancy duration. These 
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models are applied to segmentation of a Beethoven sonata into 
musical sections iflOl and to speaker diarization ED- Stepleton 
et al. Il42l used the block diagonal infinite hidden Markov 
model for musical theme labelling. However, these methods do 
not perform incremental online learning, whereas we propose 
an online incremental clustering method that uses a separate 
segmentation method (onset detection) and switches relatively 
rapidly between states. Bargi et al. (3J have adapted HDP- 
HMM to an online setting employing an initial supervised 
learning phase (bootstrap) whereas our approach is entirely 
unsupervised. 

A part of the work covered in this paper, the application 
of the hierarchical n-grams on the Voice data, has been 
presented previously j26l . Here we compare that method 
with the conceptual Boltzmann machine and with HDP-HMM 
on an extended data set using a more advanced evaluation 
measure (the adjusted Rand index) and providing more ex¬ 
amples of adaptive clustering. We will give an overview of 
the system, introduce its components, namely segmentation, 
timbre representation, clustering, and prediction. Then we will 
introduce the adjusted Rand index, test the performance of 
the sequence analysis algorithms under noisy conditions, of 
each system module separately, and in conjunction. Finally, 
we will give some demonstration examples. Audio-visual data 
and examples are available on the supporting website Il27i . 

II. System Overview 

The system that we present in this paper (cf. Fig. |T]» consists 
of four main stages: segmentation by onset detection, fea¬ 
ture extraction resulting in timbre representation, incremental 
clustering giving a symbol sequence, and sequence analysis 
yielding a prediction of the next symbol. In particular, the 
clustering tree generated by incremental clustering grows and 
shrinks online, driven by the most recent sounds. In turn, the 
sequence model adapts to the changing numbers of symbols. 
Segmentation and representation can be interpreted as a model 
of perception, whereas discretization and prediction can be 
considered to be a cognitive model. 
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Fig. 1. System architecture: An audio sound file is segmented, using onset 
detection. Each segment is then represented as a high-dimensional timbre 
feature vector which is clustered into symbols. Symbols are added or removed 
to the clustering tree on the fly. The symbol sequence is then statistically 
analysed, adapting to the varying number of symbols, allowing for prediction 
of the next symbol and the next inter-onset interval (IOI). 


A. Segmentation by Onset Detection 

In this section, we will explain how to segment an audio 
stream into events, using onset detection. In order to be more 
generally applicable, we have employed the complex domain 


based onset detector Col. since it subsumes onset detection 
algorithms based on energy, spectral difference, or phase as 
special cases. This onset detection function captures onsets 
due to abrupt energy changes as well as soft onsets induced 
by pitch changes, with little energy variations. For each frame 
Z, the short-term Fourier transform yields a complex spectrum 
X k (l) = with magnitude r k and phase f k for the 

fc-th bin with frame length K (0 < k < K — 1). We build 
the onset detection function as the Euclidean distance between 
the actual complex spectrum X k (l) at bin k and the estimated 
complex spectrum IflOl : 

X k (l)=r k (iy^ l) , (1) 

where the estimated amplitude r k (l) is set equal to the mag¬ 
nitude of the previous frame ||Xfc(Z — 1)||, and the estimated 
phase <f k (l) is calculated as the linear extrapolation from the 
unwrapped phases of the two preceding frames: 

Ml) = princarg [ip k (l - 1) + (<p k (l - 1) - <p k (l - 2))], 

where the (p denotes the unwrapped phase and the princarg 
operator maps the unwrapped value back to the (—7r, 7 t] range. 
We calculate the bin-wise Euclidean distance between the 
actual and the estimated complex spectrum, quantifying the 
stationarity for the Zc-th bin as: A k (l) = || X k (l) — X k (l) |. By 
summing across all K bins and across M + 1 consecutive 
frames centered around frame l (smoothing), we yield the 
onset detection function: 

LfJ K-l 

r >( l ) = J] S J2 Ak ( l + ^- (2) 

j=r^i fe=0 

Similarly to previous approaches 0, an adaptive threshold 
6(1) is used. This threshold is calculated as the scaled median 
across a look-ahead window of length P + 1 

6(p) = C • median„ g(PiP+l! ... :/+P) (p(n)), (3) 

with 0 < C < 1 being a predefined parameter controlling the 
sensitivity of the onset detector. In order to eliminate multiple 
occurrences of onsets shortly one after another, smoothing is 
applied via another window of length W+l centered at sample 
Z: 

m 

/i(Z) = max(rj(l + to) — 9(1 + to), 0) (4) 

m=r-fl 

A silence threshold 9 S is applied: 

Ps(l) = max(/r(Z) - 9 S , 0). (5) 

Finally, the local maxima of /i, s (Z) define the predicted onset 
times. 

B. Feature Extraction for Timbre Representation 

For each onset, a short window of length L subsequent to the 
onset time is analyzed. For each frame within this window, the 
first 13 Mel-Frequency Cepstrum Coefficients (MFCC) lf3Tl 
are calculated. To model the coefficient’s temporal behaviour 
right after the onset, for each coefficient another Discrete 
Cosine Transform (DCT) is calculated on the sequence of 
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coefficients across the frames. Taking the first 4 DCT co¬ 
efficients for each MFCC yields a 52-dimensional vector, 
representing timbral features both of the sound event’s spectral 
characteristics and their initial temporal development. 


C. Incremental Clustering for Symbol Sequence Generation 

The clustering stage receives multivariate feature vectors 
from the preprocessing stage and converts them into symbols. 
It is important to state that in our system the events are 
clustered in an online manner and in order of arrival, since 
this symbolic representation is used immediately to create 
predictions of future events. As a reference and benchmark, 
we compare online clustering by Cobweb with a state-of-the- 
art batch clustering method exploiting sequential information, 
the HDP-HMM. 

1) Cobweb: For this purpose, Marxer et al. f28l used the 
Cobweb Q21- Cobweb is an incremental clustering model 
which continuously builds a knowledge tree (hierarchical 
partitioning of the object space) and assigns to each instance 
a partition created at each level until the object reaches the 
leaves of the tree. Each node of the tree represents a concept. A 
concept is modelled by a univariate Gaussian for each feature 
dimension. The edges of the structure represent taxonomic 
relations. Further works l29l |45j| have proposed techniques 
to create, in an unsupervised manner, the concept tree based 
on the sequence of data presented, by the use of a heuristic 
function to be maximized. The heuristic function used in 
this paper is the numerical version of the standard category 
utility function used by Fisher and introduced by Gluck and 
Corter fl6l , The version of Cobweb that we will use was 
presented as Cobweb/3 f29l and later extended as Cobweb/95 
ma. This algorithm clusters //-dimensional feature vectors 
x = (x\,... ,xd) extracted in the previous section. Consider 
a particular cluster containing I feature vectors. Let ad be the 
standard deviation in component d of the input feature vectors 
assigned to that cluster. Then l ^ * s t ^ le specificity of 
that cluster across all feature dimensions. We consider the 
utility U to quantify the gain in specificity by splitting this 
cluster into K child clusters. For a potential child cluster 
1 < k < K with Ij v instances and each input feature dimension 
d we define adk to be the inner cluster standard deviation in 
that dimension. Then is the specificity of cluster k, 

and X^kLi XEi 5 ^: is the specificity of the child clusters 
altogether. For the cluster utility holds 


U oc 


1 

K 



1 

max((7 dfc ,a) 


E 



( 6 ) 


The acuity parameter a is an upper limit of maximal 
specificity (minimal standard deviation) of the clusters, thereby 
controlling the maximal resolution of the clustering discrimi¬ 
nation. 

The incorporation of an object is a process of clustering 
the object by descending the tree along an appropriate path, 
updating counts along the way, and possibly performing one 
of several operations at each level. These operators are: 

• creating a new node. 


• removing all children from a node (pruning), 

• combining two clusters into a single node, and 

• splitting a node into several nodes. 

While these operations are applied to a single object set 
partition (i.e., set of siblings in the tree), compositions of 
these primitive operations transform a single clustering tree. 
As a search strategy we use hill-climbing through a space of 
clustering trees. 

Thereby, the input is converted into a sequence not only of 
symbols, but also of meta symbols (partitions) according to 
their parent nodes and grandparent nodes in the cobweb tree. 
The symbols and meta symbols provide the alphabet on which 
expectations will be generated by the hierarchical N-gram. 

We modify the set of possible Cobweb operations (see 
above) in order to achieve persistent partitioning. This reduced 
set of operations can perform any of Cobweb’s original 
operations. We reformulate the second Cobweb operation 
(see above) in order to control the clustering only by new 
incoming events. Other partitions and past events should not 
be considered. This reduces the operations to: 

• creating a new partition inside a container partition, 

• removing a partition, reparenting it’s children if it has 
any. 

2) Hierarchical Dirichlet Process Hidden Markov Model 
(HDP-HMM): The feature vector sequence may also be mod¬ 
elled as the emission of a HDP-HMM, a Bayesian nonpara- 
metric model in which the hidden states can be considered 
as clusters. Given the observed feature vector sequence, the 
most likely hidden state sequence can be interpreted as a 
sequence of symbols. In the HDP-HMM, the hidden states are 
assumed to be drawn from a countably infinite state space. 
The HDP-HMM is used to jointly estimate the number of 
clusters, the cluster assignment of the feature vectors, and the 
transition probabilities between clusters. Inference in the HDP- 
HMM is performed using the weak limit approximation iTOll 
implemented in pyhsmm H9l Q However the inference does 
not work in an online manner, it requires the entire feature 
vector sequence as input. This method is offline (batch mode) 
and is only used as a reference and benchmark, since it does 
not fulfil the constraint of clustering the feature vectors as 
they arrive to perform immediate prediction from the very 
beginning. 


I). Sequence Analysis for Next Symbol/Onset Prediction 

We choose two methods (hierarchical N-grams and con¬ 
ceptual Boltzmann machine) l37l that require relatively little 
storage by deducing frequency counts for longer sequences 
from frequency counts of their shorter subsequences. These 
algorithms iteratively predict the next symbol (or the inter¬ 
onset interval= IOI respectively) Ct +1 based on previous 
symbols (IOIs) ct- n +h Ct-n+2, ■ ■ ■, Ct, previously generated 
by incremental clustering. Thereby we derive which sound 
to expect when. The prediction of symbols and of IOIs 
is performed independently. By predicting the IOI, we can 
determine the onset time of symbol c t +\. 

1 http://github.com/mattjj/pyhsmm 



4 


1) Hierarchical N-Grams (HN): N- grams have been used 
in the analysis of genome sequences and in language modeling 
sm. Exhaustive iV-grams count the instances of all possible 
symbol (IOI) sequences of length A. Their memory require¬ 
ment is exponential in the sequence length A and the problem 
arises how to account for patterns that have not occurred before 
(zero frequency problem). We use A-grams as an estimate for 
the forward conditional distribution for online prediction of 
the next symbol (IOI). 

Hierarchical A-grams (HN) El need less memory than 
exhaustive A’-grams. HN are a combination of sparse A-gram 
models in a hierarchical structure that allows compositional 
learning. Compositional learning consists in learning long 
patterns from already learned sub-patterns. In sparse A-grams 
counts of the most frequent patterns and a separate total count 
for the non-frequent patterns are kept. This technique separates 
the estimates of patterns whose statistics are reliable from the 
estimates of infrequent patterns whose statistics are biased. On 
the other hand, the multi-width exhaustive approach consists in 
keeping the count of all possible patterns of at most length A. 
These models are able to represent any distribution of patterns 
up to width A. 

Let C\ = {c 1 ,..., cl Cl I} be the set of cluster indices, 
renumbered so that they reflect the order of their first appear¬ 
ance in the symbol sequence c = (a, ..., c*), achieved from 
the clustering process in the previous section. C\ forms the 
alphabet of the n-gram. Then, C,„ is the set of all possible 71 - 
grams of length n composed from alphabet C\. To exploit 
sparsity, we only consider the patterns that have actually 
occurred as a subsequence of c so far until time t. The set 
of patterns of length n having occurred so far will be denoted 
by C n = {c 1 ,... in which again the subpatterns are 

ordered according to their first appearance. o(c) is defined 
as the position of c in C\ c \. We consider HNs of maximal 
length A. Let C n ^(n < A) be the frequency count of the 
7-th pattern of length n and let 7’,, ,; be the total count of 
patterns of length n since pattern i occurred for the first time. 
In Algorithm |T| we use the counts TA , and C n j to iteratively 
estimate the joint probabilities P n i for all patterns seen so far. 
We define T n 0 := T) 1 for 1 < n < A. In simple A-grams, 

’ ’ (J 

the empirical frequency could be used as an estimate for 
the probability of a pattern of length n. In the HN method 
(Eq. 12 1 , the probability A "A 1 for the 7-th pattern of length 
n unaer the joint distribution of width n — 1 is estimated. 
Lor pattern c l = (c \,..., c l n ), statistical estimates (Eq. 8.2 in 
|Pfleger) > 


pn— 1 _ 


P( c l...,ci l _ 1 )-P(c^...,c i n ) 


^yeCi -^( c 


Ui >y) 


(7) 


are calculated, using the sub-patterns of the 7th pattern of 
lengths n. We estimate the probability Q™” 1 (Eq. lit of sub¬ 
patterns of length n — 1 of the 7th pattern of length n of 
not being a subpattern of the first i patterns of length n. In 
Eq. [12] they are weighted by their confidence. The confidence 
values depend on the number of occurrences of the patterns. 
Therefore, when a pattern of length n has appeared rarely 
in the data stream, its probability of occurrence is estimated 
from a small number of counts and is not reliable. In this 
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Fig. 2. The Effect of a concept merge in the hierarchical n-gram. Nodes c 
and d are merged into the new symbol e. The n-gram inherits the counts for 
patterns including c and d to patterns including e. 



Fig. 3. Illustration of the continuous composition of symbols (atoms) in the 
Boltzmann Machine into longer patterns (chunks). 


case the probability of appearance is better estimated from 
the ri—1 length sub-patterns through P ™~ 1 . In other words, 
the information of patterns of large lengths is integrated with 
the information of models of small lengths. Pfleger shows that 
the probability of a given pattern can be calculated in a linear 
sweep by updating all the probabilities in order of the pattern’s 
first occurrence and length. 

In order to adapt Pfleger’s HN El to our architecture, we 
have to link the operations of the clustering model to the 
operations on the 71 -gram (Fig. [2|. When two or more clusters 
are merged in the clustering model, we have to remove the 
superfluous clusters from the set of cluster indices (Eq. [8]) 
and to sum up the counts for the merged clusters (Eq. [9F. 
For example, if the 71 -gram tracks patterns bbc and bbd and 
suddenly the clustering model merges symbols c and d into a 
new symbol e, the n-gram must sum up the counts of bbc and 
bbd and substitute them with the count of bbc. 

2) Conceptual Boltzmann Machine (CB): The Boltzmann 
machine |T| is a stochastic, symmetric-recurrent neural net¬ 
work that can be used to represent a joint distribution of 
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Algorithm 1 The Hierarchical N-Gram for Merged Clusters 
Initialization C n = {} tor 1 < n < N 
for incoming event c t do 
for 1 < n < N do 

if (ct — n +1 ? • • • 5 c t )£C n then 

Add new pattern: C n =C n U (c t _„+ 1 ,..., c t ), T n> | C „| = 0, C'„ i | Cn | = 0 

end if 

if c 1 , ..., c k £ C n are merged by Cobweb then 


o' = min(o(c 1 ),..., o(c k )) 


C n , 0 ,=Y,Cn, 


o(c k ) 


i= 1 


= C„\{c 1 ,... ,c° 1 ,c°' +1 ,...,c fc } 


Update indices 

end if 

Update counts: C'„, 0 (c t _ n+1 ,...,c t ) = C'n,o(c t _„ + i,...,c t ) + 1 
Update total counts: T rl)i = T n j + 1 for 1 < i < \C n \ 

end for 

Calculate joint probabilities: 

P?,i=\k (1 < < < |Ci|) 

for 1 < n < N do 

for 1 < i < \C n \ do 

i 

QZi = { l-E P ^) (rn = n,n- 1) 

k =1 


Calculate 1 according to Eq g 


pri _ _ 

r n.i rji 

1 1,1 




p 


n —1 


J=0 


Q 




end for 
end for 
end for 


( 8 ) 

(9) 

GO) 


( 11 ) 


( 12 ) 


random variables, to complete patterns, and in particular (as 
in our case) to predict the continuation of a time series. 

Formally, a Boltzmann machine consists of a vector of 
binary units (si, £ {0,1} J , and symmetric weights 

Wij £ R between pairs of units (s*, Sj), an update rule for the 
units and a learning rule for the weights. 

Applied to categorical data |f38l , a U-valued symbol c u is 
encoded as binary units s Ul ,■■■■, s Uv with s Uii = 1 if and 
only if c u = v. To connect two U-valued variables c„ and 
Ci , V 2 weights Wi. Uv are needed to connect the binary units 
representing the two variables. Initially, the architecture of our 
particular Boltzmann machine implementation consists of sets 
of binary variables for consecutive symbols, where the binary 
nodes of each variable are initially only connected to the 
binary nodes of the previous and the next symbol. Depending 
on the other units and weights, the stochastic softmax update 
rule for the symbol is: 


P(ci = j) 


1 + e 




(13) 


with temperature T decreasing from T = 50 to T = 0.005 in 
100 steps. As an example of Gibbs sampling, this update rule 
is applied iteratively. In general, through simulated annealing 
of the temperature T, the states converge to a particular state 
vector GQ. 

For training the Boltzmann machine, the weights have to 
be learned. As in the case of the restricted Boltzmann machine 
EH, in our case, not all pairs ( Si,Sj) are connected by non¬ 
zero weights Wij. Units representing the same symbol are not 
connected among each other. For each binary previous symbol 
sequence, the update rule ( fl3j ) is iteratively applied until the 
final states are reached (denoted by sj 1 "). In addition, the update 
rule is applied with no units fixed until another vector of 
final states (s~) is reached. Then a stochastic gradient-based 
learning step for the weights can be performed with learning 
rate /j for a single training instance yielding sf .sj: 

A Wij =fj,(s+sf- s~ sj). (14) 

The learning step aims at minimizing the difference between 
and s“s7- U = 0.1 is used. 

1 J 1 J 
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When weight Wij rises above a threshold 9 W , a new hidden 
unit is created, representing the concatenation of symbols 
connected by strong weights (cf. Fig. |3]i 1381 . In addition, 
weight Wij is removed. Iteratively, hidden units for patterns 
of length n + 1 are created from nodes representing patterns 
of length n and a new set of binary nodes representing patterns 
of length n is appended. We set 9 W = 0.2,0.15,0.1,0.05 
respectively, depending on the length of the pattern the unit 
represents (length 1,2,3,4). This variant of the Boltzmann 
machine is called the compositionally-constructive categorical 
Boltzmann machine lf38l . For predicting the next symbol ct+ i, 
in a trained Boltzmann machine, the respective units are fixed 
to the previous symbol sequence c t - n +i, ■ ■ ■ At- After running 
the unit update rule © until convergence, the predicted 
next symbol c, t +\ in the sequence can be retrieved from the 
corresponding binary units of the Boltzmann machine. 

In our system we have implemented a new method called 
conceptual Boltzmann machine (CB). In Pfleger the Boltz¬ 
mann machine acts on a static set of categories. We have ex¬ 
tended this to an architecture which operates on a dynamically 
changing taxonomy of categories. Therefore, the model adjusts 
to the tree structure generated by the Cobweb. This means the 
Boltzmann machine changes the architecture on the fly guided 
by the creation, removal, splitting, and merging operations 
suggested by the Cobweb. Accordingly, in the Boltzmann 
machine, the units and the update rule must be adjusted to 
the new structure. 

During the run, sequences of atoms cause the creation of 
higher-level chunks that represent patterns. The newly created 
chunks that represent patterns are then further chunked into 
nodes that represent patterns of longer length. The longest 
pattern represented by a node is fixed to a value of N. 


III. Performance Analysis of the System 
A. Measures for Clustering Evaluation 

Unlike in supervised learning, where accuracy can be mea¬ 
sured between the annotated labels and the labels predicted by 
a classifier, the number of clusters predicted by the analysis 
can be different from the number of annotated label categories. 
In addition, the mapping between annotated and predicted la¬ 
bels is unclear. This creates the need for a particular clustering 
evaluation measure. The following measures for evaluating 
the agreement between annotations and predicted labels have 
been suggested: purity ll47l . F-measure 1211 . and Pearson’s chi 
squared coefficient 144fl , and Rand index 1448 . We choose the 
latter measure for evaluation, since it is a natural extension of 
classifying elements to pairs of elements. 

A partition (clustering) C of a set X is defined as a set 
C = {Ci,... ,Cj} of subsets Cj C X, so that U jCj = X and 
Cj,Cj' disjoint for j j'. Let P be the set of all partitions 
of X and let |C| be the number of elements in a partition 
C £ P. Let A £ IP be a partition generated by annotation and 
let C £ P be a predicted partition derived from an algorithm. 
Let V = {(x,x')\x,x' £ X,x x'} be the set of pairs of 
distinct events. Let £ C V be the set of events pairs where 
both x and x' share the same labels/annotation provided by A 
and let K C V be the set of event pairs where both x and x' 


lie in the same cluster provided by C . Then \K (T C\ are the 
number of point pairs that lie in the same cluster and - at the 
same time - share the same annotated labels. For |C| > 1, the 
Rand index m is defined as: 

2(|£n£| + \r\cnr\ic\) 


R{A : C) = 


(15) 


|C|(|C|-1) 

Since R depends on the number of clusters |C|, we adjust 
the Rand index, comparing it with the expected value of R 
(baseline of a random clustering) Er. The expected value of 
R over all partition combinations V x V is calculated as 133 : 

1 

|Cp 


E « = ]n2 E R ( A ^- 

A,ce p 


(16) 


Er gets maximal for A = C,: 

1 


Kmax — |£| 2 R(C,C). 


(17) 


CdV 


(18) 


Then the adjusted Rand index (ARI) holds: 

■^max -t-JR 

The ARI has values between 0 (random partitioning) and 1 

(A = C). 

The ARI assumes that annotations and clusterings are drawn 
randomly with a fixed number of clusters and a fixed number 
of elements per cluster ]44fl . Although this assumption will not 
always be true in our evaluation, we will use ARI, since it is 
a more established measure than alternative ones, such as the 
Fowlkes-Mallows index, the Mirkin metric, the Jaccard index 
m, or entropy-based measures M S3, e.g. normalized 
mutual information and variation of information. 

In evaluating our system, we use the ARI in two ways: in 
the evaluation of 1) the clustering of the feature vectors of 
each event (Tables IV and [Vj and of 2) the prediction of the 
entire symbol sequence, as explained in the sequel. According 
to Fig. [] by segmentation, feature extraction, and clustering, 
the input sound wave is transformed into a sequence c = 
(ci, C 2 ,. • •, or) of T events, each one represented as one of 
J symbols. All occurrences of symbol j can be included in a 
cluster Cj that contains all the indices t where event ct equals 
symbol j. Then C = (Ci,C 2 , ■ ■ ■ ,Cj) is a partition of X = 
(1,2,... ,T). To evaluate the prediction c, we annotate one 
of the ground truth labels (1,2 ,...,/) to each segment of the 
input, yielding an annotated sequence a = (ai, ■ ■ ■, clt)- 
From this, a partition A = {A \,..., Ai) can be generated in 
the same way as the partition C for c. The number / of the 
annotated labels is not necessarily the same as the number of 
symbols J determined by the clustering stage of our system. 
Then the ARI can be used to compare C and A. as done in 
Tables TJII and Fig. 4][6 


B. Data Sets 

Two sets of test data are employed: 

• Repetitive symbol sequences: We generate sequences that 
consist of patterns of length rq = 2,..., 5 made up 
of / distinct symbols. These patterns are repeated 20 
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Fig. 4. Learning rate of the two sequence learning algorithms (CB and 
HN), depending on the number of pattern repetitions. The ARI (Eq. 1 1 X} is 
given for an increasing number of repetitions of a pattern with various lengths 
rii =2, 3,4, 5, 6. HN reaches a perfect ARI quickly, in contrast to CB. 


Fig. 5. Robustness of the two sequence learning algorithms (CB, HN, cf. 
Fig.[4j with respect to skipping noise. The ARI is given for a sequence of 20 
repetitions of patterns of different lengths ni and increasing probability p s k 
of randomly skipping an event. For p s ^ < 0.4, HN performs better than CB. 


times. For each pattern length ni and each partition 
A = {-4.i, -4.2, • • ■, Ai } of (1, 2 ,..., m), one sequence is 
generated in a way so that elements of each partition sub¬ 
set Ai are symbol i’s positions in the sequence. E.g. for 
ni = 5 and partition A = {-4i, -4 2 } = {{1, 3,5}, {2,4}}, 
symbol ’1’ occurs at positions -4i = {1, 3, 5} and symbol 
’2’ occurs at positions A 2 = {2,4}, yielding the symbol 
sequence (’l’,’2’,’l’,’2’,’l’). 

• Audio recordings: 

Voice: Informal low quality and short voice recordings 
of very simplified beat boxing, each consisting of 2- 
3 different sound categories with different degrees of 
tonality with a simple changing rhythm, a sequence 
of a repetitive three-sound pattern, and a ritardando, 
altogether 5 recordings each of 10-13 s duration. In 
order to demonstrate the unsupervised character of our 
system we choose sounds that do not belong to a 
predefined category (e.g. an acoustical instrument). 
ENST Drums: Formal high quality and automatically 
annotated recorded drum sequences. 5 segments de¬ 
scribed in terms of style, complexity and tempo as 
disco (simple slow, complex medium), rock (simple 
fast), country (simple slow, complex medium) ED. 
The audio recordings are annotated, so they can be 
evaluated. Audio data is available on the website El. 


C. Results 

The system architecture consists of the processing chain: 
1) onset detection and feature extraction 2) clustering, 3) 
expectation. We will evaluate stages 1), 2), 3) in isolation, 
1) + 2) together (referred to as transcription ) and the entire 
chain 1) + 2) + 3) together (referred to as prediction). We use 
the repetitive symbol sequences, in order to assess expectation, 
i.e. learning rate and noise robustness of the sequence analysis. 
The audio recordings are used to test the processing stages of 
the entire system separately. 


1) Learning Rate and Noise Robustness with Repetitive 
Symbol Sequences: We assess the learning rate and noise 
robustness of the sequence analysis stage (CB and HN). The 
sequence learning algorithm receives an initial chunk of a 
repetitive symbol sequence (Section [lII-B[ ) as input. From this, 
the algorithm determines the most probable (expected) next 
4 ni symbols. The algorithm outputs the expected next symbols 
Cj+i,..., c nr 4 , given the annotated symbols a\, 0 , 2 , ■ ■ ■, at- 
For each t , from the predictions cj+i, Ct+ 2 , ■ ■ ■, c n ,.4 a par¬ 
tition C is generated and compared with the partition of the 
corresponding A based on annotations, as explained in Sec¬ 
tion III-A For t < 711 -5 all annotations so far are used for pre¬ 


diction, then only the last 12 annotations at- 11 , at- 10 , ■ ■ ■ a-t 
are used for prediction. For the stochastic BM, ARI is averaged 
over 100 runs of all partitions of a given length rij . The trivial 
sequence that consists of a constant repetition of the same 
symbol is not considered. First we assess how the learning rate 
scales with pattern length and number of pattern repetitions. 
Fig. [4] shows the averaged ARI across of all partitions of 
lengths ni = 2,... 5. For this test, the HN is set to a maximum 
A’-gram length of A r = 5. The HN reaches perfect prediction 
(ARI=1) after In/ events (2 pattern repetitions). CB seems to 
converge much more slowly than HN, for nj = 2 reaching an 
ARI of higher than 0.8 after 8 events, then increasing much 
more slowly. For higher n;, CB seems to converge towards 
perfect prediction even more slowly. 

Different types of noise are used to transform the sequence 
in order to assess the robustness of the sequence learning 
techniques: 

Skipping noise: In the original sequence, a symbol is skipped 
with a given probability 0 < p s k < 0.95. 

Switching noise: In the original sequence, with a given prob¬ 
ability of 0 < p sw < 0.95, a symbol is selected randomly 
with uniform distribution across the n ; alternative sym¬ 
bols. 

The average ARI is calculated over 100 runs for ni = 2,3, 
over 50 runs for ni = 4 and 20 runs for rq = 5 for both 












Fig. 6. Robustness of CB and HN (cf. Fig. [4) with respect to switching noise. 
The ARI is given for an increasing probability p sw of randomly switching a 
symbol. HN performs better than CB for p sw < 0, reaching random guess 
level (ARI=0) for p sw = 0.5. 


CB and HN. Fig. [5] shows how the prediction performance 
(ARI) is affected by skipping symbols with a defined p sfc 
in the repetitive symbol sequences. This simulates e.g. the 
failure of the onset extraction algorithm to detect an event. The 
prediction is performed, given a sequence of 20 repetitions of 
the basic pattern. For HN and CB, the performance degrades 
until p s k = 0.5, where random guess level is reached (ARI=0). 
Until p s k = 0.4, HN appears to be more robust towards 
skipping noise than CB, with CB having a worse ARI for 
higher m. 

In Fig. [6] the effect of clustering errors on the sequence 
learning process is simulated. With increasing switching prob¬ 
ability ps W , a symbol is replaced by any of the ni symbols 
under uniform distribution. The graph shows the prediction 
performance using the ARI for CB and HN for different 
pattern lengths. The results are similar as for skipping noise 
(Fig. HN is more robust wrt noise than BM, reaching 
random guess level (ARI=0) for p sw = 0.5. It can be 
summarized that for relatively small noise the HN appears 
to be more robust to skipping and switching noise, especially 
for longer pattern lengths. 

2 ) Testing of Processing Stages with Audio Recordings: 
The tests with the Voice recordings (Section III-B| > serve as 
a proof of concept of clustering with dynamically varying 
numbers of clusters. The ENST recordings are used for a more 
comprehensive quantitative evaluation of the system. We test 
each process stage separately. For the audio, the sample rate is 
f s = 44100 Hz. For segmentation and feature extraction, the 
hop size is 128 samples, and the window size is 1024 samples. 

a) Onset detection: For the evaluation of the onset de¬ 
tection (Section |II-A|i we employ a widely used procedure 
12 |23||. Onset times manually annotated by subjects serve 
as references. The onsets estimated by the onset detection 
algorithm are then compared to the manually annotated onsets. 
Annotated and estimated onsets are considered a match when 
their difference in time is smaller than a given threshold. In 
our evaluation, we use an onset match threshold of ~ 50 ms. 


Since the data is assumed to be monophonic, the evaluation 
only permits a one-to-one mapping between estimated and 
annotated onsets. 

Using the following onset detection parameters: smoothing 
length M = 33 in Eq. [2j sensitivity C = 0.9 and look¬ 
ahead window length P = 10 in Eq. [3j threshold window 
length W = 11 in Eq. [4] and silence threshold 9 S = 0.002 in 
Eq. [5j onset detection yields an F-measure of ~ 99% for the 
Voice data set. Therefore, we focus on the clustering and the 
prediction stage. We also notice that for smoothing lengths 
M > 33 the system does not improve significantly. Large 
smoothing lengths reduce the temporal precision of onsets, 
which is important for good feature extraction, since most of 
the information about an event is located in the attack. 

b) Clustering: We now compare the performance of 
our incremental online Cobweb clustering and benchmark 
offline (batch) HDP-HMM clustering with a constant yet 
inferred cluster number as a benchmark. In order to assess 
the clustering process in isolation, we assume error-free onset 
detection on the previous stage. In order to achieve this, 
we use the annotated onsets as input. In order to assess 
the stability of the system, we tested it performing a grid 
search on the two most sensible parameters involved in the 
task and the algorithm. For Cobweb, we explore the analysis 
window length L (Section II-B i and the acuity a (Eq. |6j. 
On the parameter grid, the window length/acuty pair with 
maximal ARI is determined, extending the parameter grid if 
the maximum lies on the grid border, with empirically set 
constant grid step sizes. 

For Voice, Cobweb performance peaks at ARI=82.7% for 
L = 150ms, a = 18.5 on a parameter grid over L = 
50, 75,..., 175; a = 15,15.5,..., 19. For ENST, Cobweb per¬ 
forms best at 85.7% for L = 50 ms, a = 13.5 on a parameter 
grid over L = 25,50,..., 100; a = 13,13.5,..., 15. (cf. Ta¬ 
bles IV and [V] in the supplementary material |27l ) This means 
that the timbre model and clustering process can successfully 
classify the audio events. We also notice that Voice needs 
a much longer analysis window than ENST. This test, as 
explained above, was performed using the annotated onsets. 
The results could change when the onsets are estimated. This 


effect is evaluated in the transcription test (Section III-C2ci. 

For the HDP-HMM, we first reduce the feature vectors of 
the input to D dimensions by means of a PCA on the full 
sequence. The observation distributions used are Gaussian with 
parameters sampled i.i.d. from a normal inverse Wishart prior 
lfl9l with parameters /xq = 0, kq = 0.4, Ao = 0.001, zz 0 = 
D + 20 The maximum number of states of the weak limit 
approximation inference is set to 10 and the number of Gibbs 
sampling iterations to 100. For Voice, HDP-HMM perfor¬ 
mance peaks at ARI=99.1% for 7 = 8.0, a = 7.0, D = 2 on 
a parameter grid over 7, a = 4.0, 5.0,..., 11.0 and D = 2,3. 
For ENST, HDP-HMM performs best at AR1=84.0% for 
HDP concentration parameters 7 = 6.0, a = 12.0 S3 and 
D = 2 on a parameter grid over 7, a = 4.0,5.0,..., 13.0 and 
D = 2,3. Benchmark HDP-HMM performs better for Voice 
than Cobweb, whereas for ENST, Cobweb performs 1.7% 


2 Cf. Murphy (33], Section 9.2., p. 20 for the meaning of the parameters. 
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TABLE I 

Expectation of Voice (left) and ENST (right): ARI (in %) for 

DIFFERENT MAXIMUM LENGTHS N OF CB/HN (ROWS). 


N 

CB 

HN 

N 

CB 

HN 

2 

7.4 

22.4 

2 

6.0 

18.9 

3 

6.8 

27.3 

3 

9.1 

28.9 

4 

7.3 

41.1 

4 

7.8 

43.2 

5 

5.1 

50.9 

5 

6.3 

42.7 

6 

4.4 

50.9 

6 

6.6 

42.7 

7 

5.1 

50.9 

7 

7.7 

42.6 


TABLE II 

Full Prediction for the Voice data set using HN: ARI (in %) for 

DIFFERENT TEMPORAL ACUITIES a t FROM EQ.[6](ROWS) AND TIMBRAL 
ACUITIES a FROM EQ.[6](COLUMNS). 


a t \a 

17 

17.5 

18 

18.5 

19 

19.5 

20 

0.05 

16.7 

15.5 

16.3 

16.0 

25.2 

23.8 

23.8 

0.0625 

15.9 

15.3 

16.2 

16.0 

25.2 

24.0 

24.0 

0.075 

14.4 

14.3 

15.2 

16.7 

27.2 

25.8 

25.8 

0.0875 

15.7 

16.4 

17.2 

17.3 

25.8 

24.5 

24.5 


better than HDP-HMM. When comparing these results one has 
to keep in mind that HDP-HMM clustering has learned offline 
jointly a stable cluster number and the transition probabilities, 
exploiting sequential information whereas Cobweb has been 
trained online with an adaptive cluster number. 

c) Transcription: The transcription test evaluates the 
subsystem composed of onset detection, feature extraction, and 
clustering. In contrast to the expectation test, the entire symbol 
(inter-onset interval) sequence Cy,C 2 ,... ,Ct extracted from the 
clustering stage is always used from the beginning to predict 
the next symbol (inter-onset interval) ct+ 1 - The annotations a 
are not used for prediction, only for evaluation. The partitions 
generated from the detected events were compared with the 
partitions generated from the annotated labels using ARI. 
Online learning Cobweb with dynamically changing clustering 
numbers and offline learning HDP-HMM with a constant 
cluster number are compared. For Cobweb, Voice performs 
with ARI = 81.3% for L = 150, a = 17. On the ENST data 
set Cobweb yields ARI = 76.3% for L = 50, a = 13.5, using 
the same parameter grids as for the clustering (p. III-C2b| i. In 
comparison to the results for clustering, the ARI degrades 
a bit in particular for ENST due to wrongly estimated on¬ 
sets. (cf. Tables VI and VII in the supplementary material 


|[27l ) HDP-HMM transcription performance for Voice peaks at 
ARI=98.8% for 7 = 8.0, a = 12.0 on a parameter grid over 
7 ,a = 4.0, 5.0,..., 13.0. For ENST, HDP-HMM transcription 
performance peaks at ARI=76.2% for 7 = 5.0, a = 8.0 on 
the same parameter grid as for Voice. For ENST, HDP-HMM 
and Coweb are almost equal. Although for Voice, the ARI is 
much higher for the HDP-HMM benchmark than for Cobweb, 
we have to keep in mind that Cobweb learns online with 
changing cluster numbers over time whereas HDP-HMM is 
trained offline with a constant number of clusters. 

d) Expectation: The expectation test evaluates the per¬ 
formance of the sequence learning module on the data sets. 
We predict the cluster label c t+ -y of event t + 1 based on the 
annotations from the start: ai, 02 ,..., at- Results in Table [I] 
show that for the prediction of the sequences of the Voice and 
the ENST data set, HN ( ARI = 43.2% for A r = 4) works 
a lot better than CB, which yields an ARI = 7.8%, just 
slightly better than random (0%). CB’s low performance can 
be attributed to various factors: In general, many traditional 
recurrent networks are known to have a slow learning rate. lfl 8 l 
In particular, we have observed slow learning rate (Fig. [4]) 
and low noise robustness (Fig. [5] & Fig. [ 6 }. Whereas for HN, 
the updates in frequency counts are getting smaller relative 
to the count so far (from 1 to 2 is a higher step relative to 
1 than from 100 to 101 relative to 100), in CB the weights 


table m 

Full Prediction for the ENST data set using HN: ARI (in %) for 

DIFFERENT TEMPORAL ACUITIES a t FROM EQ.[6](ROWS) AND TIMBRAL 
ACUITIES a FROM EQ.[6](COLUMNS). 


a t \a 

18 

18.5 

19 

19.5 

20 

20.5 

21 

0.075 

33.1 

33.8 

33.0 

32.6 

36.2 

34.7 

33.3 

0.0875 

35.1 

36.2 

35.3 

34.2 

37.8 

36.3 

34.9 

0.1 

36.3 

37.6 

36.6 

35.4 

39.2 

37.6 

36.2 

0.1125 

35.9 

37.2 

36.2 

35.0 

38.8 

37.2 

35.8 

0.125 

34.6 

35.9 

34.9 

33.7 

37.5 

35.9 

34.5 


are updated with a constant learning rate p (Eq. [X4] >. Weight 
updates are performed by stochastic gradient descent where 
each instance is only used once when it has just occurred. 
Although this is cognitively plausible if we assume that only 
a limited number of instances can be stored by the cognitive 
system, it comes with the price of diminished learning speed, 
compared to a system where the update is performed using 
a batch of instances. In addition, the architecture of the CB 
may be suboptimal w.r.t. the hidden nodes. Also, in the 
network, new hidden nodes are only generated one at a time, 
further limiting learning speed. Furthermore, the parameters 
9k for creating new hidden nodes are chosen heuristically 
and may be suboptimal for short sequences like the one 
presented. We can also see that for n-gram maximum lengths 
n > 5 for Voice (n > 3 for ENST) the result does not 
improve. For linguistic data, slower convergence and worse 
performance of CB relative to HN is also observed in Pfleger 
ll37ll . pp. 80&133. In the sequel, we will only use HN. 

e) Prediction: The prediction task consists in running the 
full system including HN as the sequence analyzer (Tables [II] 
and 0- After the transcription of the events ci,... ,Ct, the 
system predicts the next symbol and the timing of it (the next 
IOI) Ct+ 1 - For evaluating the match between predicted and 
annotated onsets, we set the tolerance threshold to 150 ms. 
For the best configuration, the full prediction yields an ARI 
of 27.2% for Voice and an ARI of 39.2% for ENST. The 
performance is limited by the weakest performance of its 
components, in this case the sequence analysis. 


D. Examples 

In this section, we present a few examples (audio on the 
website l27l ) of transcription and prediction using HN in order 
to demonstrate the performance, evolution and shortcomings 
of the system. From Hazan et al. D3, we adopt the procedure 
to optimally map the annotated symbols to the clusters found 
by the clustering algorithm. We calculate the matching matrix 
between the annotations (’score’) and clusters of each event. 
In this matching matrix, we then iteratively yield the maximal 
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entry, thereby establishing a connection between a row (an¬ 
notations) and a column (clusters). After eliminating the row 
and column of the maximal entry, we determine the maximal 
entry again until the matrix vanishes. 



0 2 4 6 8 10 12 


Time (s) 


Fig. 7. The system (with HN) quickly captures a simple ta-tschi-bum pattern. 
Time (horizontal axis) is mapped versus event labels (one line each for ’ta’, 
’tschi’, and ’bum’). Annotated labels are indicated in black below the lines. 
Above the horizontal lines we find events that are correctly estimated (’•’), 
matched to the wrong cluster (’■’), and unmatched (’A’) due to a wrongly 
estimated onset. 


In Fig. [7] and [8] we display sequences of annotations and 
clusters on the same line if they are linked through this 
mapping. In Fig. |7J a simple ta-tschi-bum pattern is quickly 
captured. We can see how the first three events annotated 
as ta, tschi, and bum are matched with the wrong clusters 
bum (initial blue triangle above top line), ta, and tschi (red 
squares). The first three cluster mismatches are expected, since 
the system has no previous knowledge of the symbol space 
nor of the sequence and therefore cannot predict symbols nor 
patterns that have not yet occurred. At around 5.5s, an event 
annotated as tschi is matched with the ta cluster. The timing of 
the last bum is misestimated and for the last tschi timing and 
cluster matching are wrong. The time deviation errors are due 
to the fact that the recorded voice does not follow a temporally 
regular pattern. 

In Fig. [8] we observe how the system adapts to pattern 
changes within the sequence. For the first two events, the clus¬ 
ter matching is wrong. Then, after having processed enough 
sounds, the system performs correct predictions. In the middle, 
around 7.5 s, when the repetition pattern of ta is introduced, 
for three ta events the onset is wrongly estimated, two of these 
events as well being mismatched with the wrong cluster, and 
one additional event being only mismatched with the wrong 
cluster. The errors in the middle of the sequence are due to 
the pattern change. The A'-gram is able to update the statistics 
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Fig. 8. The system (with HN) adapts to a pattern change from ta-bong to 
ta-ta-bong (cf. Fig. [7}. 
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Fig. 9. Cluster merging: After 38 events, two clusters (’•’, ’■’) merge into 
one cluster (’■’). The projection of the MFCC vectors (timbre representa¬ 
tion) onto their first two principal components {above) and the incremental 
clustering tree {below) are shown. 
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Fig. 10. Creation of new clusters: After 20 and 80 sound events, new clusters 
’A’) emerge on the fly. Cf. Fig. [9] 


and perform correct predictions after three occurrences of the 
new pattern. 

In Fig. [9] (sound and video on the website 1271 1. a sequence 
of alternating bass drum and hi-hat samples is played. During 
the sequence, the hi-hat is gradually mixed in a linear fashion 
with an increasing amount of bass drum and vice versa so 
that in the end both hi-hat and bass drum are mixed together 
in a balanced way, yielding a repetitive sequence of similar 
sounds. The system recognizes the two sound clusters in the 
beginning, and finally merges the two clusters into one single 
cluster. 

In Fig. [TO] (sound and video on the website lf27l ), a sequence 
of sound events is analyzed that starts with one sound, later 
joined by a second and third sound. The system is able to split 
the initial cluster gradually into 2 and 3 clusters. 

IV. Conclusion and Perspectives 

We have presented a full system that predicts the next 
sound event from the previous events, operating on audio 
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data. Taking into account no previous knowledge, neither 
on the used sounds or instruments nor on the timing and 
rhythmical structure of the audio segment, the system starts 
from tabula rasa, performing predictions from the very first 
sound event. The system adapts to pattern changes in the 
sequence as well as the appearance of new sounds or in¬ 
struments at any time. Currently the system is limited by 
the lack of metrical analysis, making it especially sensitive 
to missed onsets. Considering the metrical context could 
significantly improve the quality of predictions. For this goal, a 
metrical alignment procedure l24l |25) could be combined with 
incremental learning. As alternatives to CB and HN, variable 
length Markov models 11 M [25] or other deep learning 
architectures can be used, thereby overcoming the context 
length limitation of HN and CB and the slow learning of CB. 
The long short-term memory (LSTM) fll8l is a recurrent neural 
network that had been developed to capture dependencies 
between disconnected distant chunks within the same time 
series. Crucial to this and for speeding up learning, in LSTM, 
special memory cells are used. The access to the latter can 
be opened and closed by special gating units. Successfully 
applied to protein homology detection, automatic composition 
on. handwriting and spoken language recognition, LSTM 
could be used to replace CB or HN and improve learning 
speed in our application. HDP-HMM ll43l could be adapted 
to online learning a with incremental addition/removal of 
clusters comprising also segmentation, thereby replacing onset 
detection. The presented system can also be modified to learn 
melodies and chord progressions. For learning melodies, in 
the feature extraction stage (Section II-B| ), MFCCs need to 
be replaced by a pitch detection method as used in Marxer 
et al. Il28l for learning songs by the Mbenzele pygmies 
or as in Cherla et al. 13 for learning guitar riffs. When 
analysing (piano) chord progressions, MFCCs can be replaced 
by constant Q profiles [ |20l [39). Future work includes the 
development of a better representation of pitch and harmony, 
using a larger training set when processing more complex 
music. 

Inspired by these ideas, we imagine a musical improvisa- 
tional dialogue between a human and a machine in which the 
human may spontaneously articulate novel ideas such as new 
sounds, motifs, rhythms, or harmonies. A dumb and ignorant 
machine would dampen and finally stop the musical flow. But 
if the machine could take up the novel idea, reply to it, varying 
the suggestions of its human partner, they could develop an 
enhanced musical conversation. 
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V. Supplement Results: Grid Search on 
Parameters 

table IV 

Cobweb clustering of Voice data: ARI (in %) for different 

T1MBRAL ACUITIES a FROM EQ.[6](ROWS) AND ANALYSIS WINDOW 
LENGTHS L (SECTIOn|II-B| COLUMNS) . 


L\a 

15 

15.5 

16 

16.5 

17 

17.5 

18 

18.5 

19 

50 

31.0 

33.0 

33.0 

33.0 

33.0 

34.3 

35.3 

35.3 

35.3 

75 

62.1 

57.8 

42.5 

42.2 

46.2 

46.2 

34.6 

34.6 

36.2 

100 

76.5 

77.6 

75.8 

78.6 

81.0 

78.9 

55.7 

55.7 

37.9 

125 

79.8 

79.8 

71.5 

73.3 

73.6 

73.7 

78.0 

78.0 

80.6 

150 

67.5 

66.6 

67.7 

68.6 

73.3 

76.2 

81.2 

82.7 

78.6 

175 

60.0 

65.6 

67.2 

67.4 

70.1 

73.5 

74.2 

75.5 

72.4 


TABLE V 

Cobweb clustering of ENST data: ARI (in %) for different 

T1MBRAL ACUITIES a FROM EQ.[6](ROWS) AND ANALYSIS WINDOW 
LENGTHS (COLUMNS). 


L\a 

13 

13.5 

14 

14.5 

15 

25 

65.6 

64.7 

63.9 

63.9 

62.8 

50 

83.0 

85.7 

80.5 

80.2 

76.7 

75 

73.9 

72.5 

69.6 

78.3 

65.8 

100 

74.2 

69.1 

67.9 

66.3 

68.8 


TABLE VI 

Onset and Cobweb transcription: ARI (in %) of acuity a from 

Eq .©FOR TIMBRE CLUSTERING (ROWS) VERSUS ANALYSIS WINDOW 
LENGTH L (COLUMNS) MEASURED ON THE Voice DATA SET. 


L\a 

15 

15.5 

16 

16.5 

17 

17.5 

18 

18.5 

19 

50 

29.8 

29.2 

28.9 

30.2 

37.4 

37.4 

37.4 

37.4 

37.4 

75 

56.3 

62.5 

57.1 

43.8 

30.7 

29.3 

29.3 

28.5 

35.2 

100 

70.5 

71.7 

77.9 

76.8 

77.7 

67.0 

49.4 

36.9 

35.6 

125 

65.1 

68.3 

72.7 

72.7 

73.0 

73.4 

73.4 

75.6 

63.4 

150 

69.0 

69.7 

71.8 

79.4 

81.3 

80.7 

78.8 

77.7 

79.0 

175 

66.4 

67.7 

68.4 

69.8 

70.8 

73.4 

73.4 

74.6 

75.6 


TABLE VII 

Onset and Cobweb transcription: ARI (in %) of acuity a from 
Eq .©for timbre clustering (rows) versus analysis window 

LENGTH L (COLUMNS) MEASURED ON THE ENST DATA SET. 


L\a 

13 

13.5 

14 

14.5 

15 

25 

65.9 

67.1 

67.1 

67.1 

67.0 

50 

67.5 

76.3 

71.4 

71.4 

71.4 

75 

63.2 

61.4 

60.6 

70.8 

70.2 

100 

62.4 

57.2 

57.9 

58.1 

61.6 




















